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Abstract 

Transferring knowledge across a sequence of related 
tasks is an important challenge in reinforcement learn¬ 
ing (RL). Despite much encouraging empirical evi¬ 
dence, there has been little theoretical analysis. In this 
paper, we study a class of lifelong RL problems: the 
agent solves a sequence of tasks modeled as finite 
Markov decision processes (MDPs), each of which is 
from a finite set of MDPs with the same state/action sets 
and different transition/reward functions. Motivated by 
the need for cross-task exploration in lifelong learning, 
we formulate a novel online coupon-collector problem 
and give an optimal algorithm. This allows us to de¬ 
velop a new lifelong RL algorithm, whose overall sam¬ 
ple complexity in a sequence of tasks is much smaller 
than single-task learning, even if the sequence of tasks 
is generated by an adversary. Benefits of the algorithm 
are demonstrated in simulated problems, including a re¬ 
cently introduced human-robot interaction problem. 

Introduction 

Transfer learning, the ability to take prior knowledge and use 
it to perform well on a new task, is an essential capability of 
intelligence. Tasks themselves often involve multiple steps 
of decision making under uncertainty. Therefore, lifelong 
learning across multiple reinforcement-learning (RL) (Sut¬ 
ton and Barto 1998) tasks, or LLRL, is of significant interest. 
Potential applications are broad, from leveraging informa¬ 
tion across customers, to speeding robotic manipulation in 
new environments. In the last decades, there has been much 
previous work on this problem, which predominantly fo¬ 
cuses on providing promising empirical results but with little 
formal performance guarantees (e.g.. Ring (1997), Wilson 
et al. (2007), Taylor and Stone (2009), Schmidhuber (2013) 
and the many references therein), or in the offline/batch set¬ 
ting (Lazaric and Restelli 2011), or for multi-armed ban¬ 
dits (Azar, Lazaric, and Brunskill 2013). 

In this paper, we focus on a special case of lifelong rein¬ 
forcement learning which captures a class of interesting and 
challenging applications. We assume that all tasks, modeled 
as finite Markov decision processes or MDPs, have the same 
state and action spaces, but may differ in their transition 
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probabilities and reward functions. Furthermore, the tasks 
are elements of a finite collection of MDPs that are initially 
unknown.' Such a setting is particularly motivated by appli¬ 
cations to user personalization, in domains like education, 
health care and online marketing, where one can consider 
each “task” as interacting with one particular individual, and 
the goal is to leverage prior experience to improve perfor¬ 
mance with later users. Indeed, partitioning users into sev¬ 
eral groups with similar behavior has found uses in various 
application domains (Chu and Park 2009; Fern et al. 2014; 
Liu and Koedinger 2015; Nikolaidis et al. 2015): it offers a 
form of partial personalization, allowing the system to more 
quickly learn good interactions with the user (than learning 
for each user separately) but still offering much more per¬ 
sonalization than modeling all individuals as the same. 

A critical issue in transfer or lifelong learning is how 
and when to leverage information from previous tasks in 
solving the current one. If the new task represents a dif¬ 
ferent MDP with a different optimal policy, then leveraging 
prior task information may actually result in substantially 
worse performance than learning with no prior information, 
a phenomenon known as negative transfer (Taylor and Stone 
2009). Intuitively, this is partly because leveraging prior ex¬ 
perience can prevent an agent from visiting states with dif¬ 
ferent rewards in the new task, and yet would be visited un¬ 
der the optimal policy of the new task. In other words, in 
lifelong RL, in addition to exploration typically needed to 
obtain optimal policies in single-task RL (i.e., single task 
exploration), the agent also needs sufficient exploration to 
uncover relations among tasks (i.e., task-level transfer). 

To this end, the agent faces an online discovery problem: 
the new task may be the same as one of prior tasks, or may 
be a novel one. The agent can treat it as a task that has been 
seen before (therefore transferring prior knowledge to solve 
it), or try to discover whether it is novel. Failing to correctly 
treat a novel task as new, or treating an existing task as the 
same as a prior task, will lead to sub-optimal performance. 

The main contributions are three-fold. First, inspired by 
the need for online discovery in LLRL, we formulate and 
study a novel online coupon-collector problem (OCCP), pro- 


' Given finite sets of states and action, MDPs with similar transi¬ 
tion/reward parameters have similar value functions. Thus, finitely 
many policies suffice to represent near-optimal policies. 



viding algorithms with optimal regret guarantees. These re¬ 
sults are of independent interest, given the wide application 
of the classic coupon-collector problem. Second, we propose 
a novel LLRL algorithm, which essentially is an OCCP al¬ 
gorithm that uses sample-efficient single-task RL algorithms 
as a black box. When solving a sequence of tasks, compared 
to single-task RL, this LLRL algorithm is shown to have a 
substantially lower sample complexity of exploration, a the¬ 
oretical measure of learning speed in online RL. Finally, we 
provide simulation results on a simple gridworld simulation, 
and a simulated human-robot collaboration task recently in¬ 
troduced by Nikolaidis et al. (2015), in which there exist a 
finite set of different (latent) human user types with differ¬ 
ent preferences over their desired robot collaboration inter¬ 
action. Our results illustrate the benefits and relative advan¬ 
tage of our new approach over prior ones. 

Related Work. There has been substantial interest in life¬ 
long learning across sequential decision making tasks for 
decades; e.g.. Ring (1997), Schmidhuber (2013), and White, 
Modayil, and Sutton (2012). Lifelong RL is closely related 
to transfer RL, in which information (or data) from source 
MDPs is used to accelerate learning in the target MDP (Tay¬ 
lor and Stone 2009). A distinctive element in lifelong RL 
is that every task is both a target and a source task. Con¬ 
sequently, the agent has to explore the current task once in 
a while to allow better knowledge to be transferred to bet¬ 
ter solve future tasks—this is the motivation for the online 
coupon-collector problem we formulate and study here. 

Our setting, of solving MDPs sampled from a finite set, 
is related to Konidaris and Doshi-Velez (2014)’s hidden pa¬ 
rameter MDPs, which cover our setting and others where 
there is a latent variable that captures key aspects of a task. 
Wilson et al. (2007) tackle a similar problem with a hierar¬ 
chical Bayesian approach to modeling task-generation pro¬ 
cesses. Most prior work on lifelong/transfer RL has focused 
on algorithmic and empirical innovations, with little theo¬ 
retical analysis for online RL. An exception is a two-phase 
algorithm (Brunskill and Li 2013), which has provably lower 
sample complexity than single-task RL, but makes a few 
critical assumptions. Our setting is more general: tasks may 
be selected adversarially, instead of stochastically (Wilson 
et al. 2007; Brunskill and Li 2013). Consequently, we do not 
assume a minimum task sampling probability, or knowledge 
of the cardinality of the (latent) set of MDPs. This allows our 
algorithm to be applied in more realistic problems such as 
personalization domains where the number of user “types” 
is typically unknown in advance. In addition, Bou Ammar, 
Tutunov, and Eaton (2015) recently introduced and provided 
regret bounds (as a function of the number of tasks) of a 
policy-search algorithm for LLRL. Each task’s policy pa¬ 
rameter is represented as a linear combination of shared la¬ 
tent variables, allowing it to be used in continuous domains. 
However, in addition to local optimality guarantees typical 
in policy-search methods, lack of sufficient exploration in 
their approach may also lead to suboptimal policies. 

In addition to the original coupon-collector problem, to 
be described in the next section, our online coupon-collector 
problem is related to bandit problems (Bubeck and Cesa- 


Bianchi 2012) that also require efficient exploration. In ban¬ 
dits every action leads to an observed loss, while in OCCP 
only one action has observable loss. Apple tasting (Helm- 
bold, Littlestone, and Long 2000) has a similar flavor as 
OCCP, but with a different structure in the loss matrix; fur¬ 
thermore, its analysis is in the mistake-bound model that is 
not suitable here. Langford, Zinkevich, and Kakade (2002) 
study an abstract model for exploration, but their setting 
assumes a non-decreasing, deterministic reward sequence, 
while we allow non-monotonic and stochastic (or even ad¬ 
versarial) reward sequences. Consequently, an explore-first 
strategy is optimal in their setting but not in OCCP. Eurther- 
more, they analyze competitive ratios, while we focus on 
excessive loss. Bubeck, Ernst, and Garivier (2014) tackle a 
very different problem called “optimal discovery”, for quick 
identification of hidden elements assuming access to differ¬ 
ent sampling distributions. Einally, compared to the miss¬ 
ing mass problem (McAllester and Schapire 2000), which 
is about pure predictions, OCCP involves decision making, 
thus requires balancing exploration and exploitation. 

The Online Coupon-Collector Problem 

Motivated by the need for cross-task exploration to discover 
novel MDPs in LLRL, we formulate and study a novel prob¬ 
lem that is an online version of the classic Coupon-Collector 
Problem, or CCP (Von Schelling 1954). Solutions to online 
CCP play a crucial role in developing a new lifelong RL al¬ 
gorithm in the next section. Moreover, the problem may be 
of independent interest in many disciplines like optimiza¬ 
tion, biology, communications, and cache management in 
operating systems, where CCP has found important appli¬ 
cations (Boneh and Hofri 1997; Berenbrink and Sauerwald 
2009), as well as in other meta-learning problems that re¬ 
quire efficient exploration to uncover cross-task relation. 

Formulation 

In the Coupon-Collector Problem, there is a multinomial dis¬ 
tribution n over a set At of C coupon types. In each round, 
one type is sampled from /i. Much research has been done to 
study probabilistic properties of the (random) time when all 
C coupons are first collected, especially its expectation {e.g., 
Berenbrink and Sauerwald (2009) and references therein). 

In our Online Coupon-Collector Problem or OCCP, C = 
I At I is unknown. Given a coupon, the learner may probe 
the type or skip; thus, A = {P (“probe”), S (“skip”)} is the 
binary action set. The learner is also given four constants, 
Po < Pi ^ P 2 < P 37 specifying the loss matrix L in Table 1. 

Table 1: OCCP loss matrix: rows indicate actions; columns indi¬ 
cate whether the current item is novel or not. The known constants, 
po < pi < P2 < P3, specify costs of actions in different situations. 
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The game proceeds as follows. Initially, the set of discov¬ 
ered items AT 1 is 0. Eorround t = 1,2,... ,T: 

• Environment selects a coupon M* C AT of unknown type. 



• The learner chooses action At G A, and suffers loss Lt 
as specified in the loss matrix of Table 1. The learner ob¬ 
serves Lt if At = P, and _L (“no observation”) otherwise. 

• \f At = P, Mt+i ^ A^t U {Mt}-, else Mt+i ^ Mt- 
At the beginning of round t, define the history up to t as 

Ht := {A4i,Ai,Li,A42,A2,L2,...,A4t-i,At-i,Lt-i). 
An algorithm is admissible, if it chooses actions At based on 
Ht and possibly an external source of randomness. We dis¬ 
tinguish two settings. In the stochastic setting, environment 
samples Mt from an unknown distribution p over A4 in an 
i.i.d. (independent and identically distributed) fashion. In the 
adversarial setting, the sequence {Mt)t can be generated by 
an adversarial in an arbitrary way that depends on Ht- 
If the learner knew the type of Mt, the optimal strategy 
would be to choose A* = P if Mt ^ Mt, and At = S oth¬ 
erwise. The loss is p 2 if Mt is a new type, and po otherwise. 
Hence, after T rounds, if C* < C is the number of distinct 
items in the sequence {Mt)t, this ideal strategy has the loss: 

L*{T)-.= P 2 C* + po{T-C*). (1) 

The challenge, of course, is that the learner does not know 
Mt’s type before choosing At. She thus has to balance ex¬ 
ploration (taking At = P to see if Mt is novel) and exploita¬ 
tion (taking At = S to yield small loss po if it is likely that 
Mt € Mt)- Clearly, over- and under-exploration result in 
suboptimal strategies. We are therefore interested in finding 
algorithms A to have smallest cumulative loss as possible. 

Formally, an OCCP algorithm A is a possibly stochastic 
function that maps histories to actions: At = A{Ht). The to¬ 
tal T-round loss suffered by A is L{A, T) := X]t=i The 
T-roMnc/regret of an algorithm A is i?( A, T) := L{A,T) — 
L*(T), and its expectation by R{A,T) := E[i?(A,r)], 
where the expectation is taken with respect to any random¬ 
ness in the environment as well as in A. 

Explore-First Strategy 

In the stochastic case, it can be shown that if an algorithm 
chooses P for a total of E times, its expected regret is small¬ 
est if these actions are chosen at the very beginning. The 
resulting strategy is sometimes called EXPLORE-FIRST, or 
ExpFirst for short, in the multi-armed bandit literature. 

With knowledge of ■= miriMeAt l-t{M), one may set 
E so that all types in M will be discovered in the first (prob¬ 
ing) phase consisting of E rounds with high probability. This 
results in a high-probability regret bound, which can be used 
to establish an expected regret bound, as summarized below. 
A proof is given in Appendix A. 

Proposition 1. For any S G (0,1), let E = In -jM. 
where p-m = Mmm^m f{^)- Then, with probabil¬ 
ity 1 - (5, i?(ExpFlRST, T) < In More¬ 
over, if E = M In then the expected regret is 

i?(ExpElRST, T) < (in + 1 ) . 


small Pm- Moreover, in many scenarios, the sampling pro¬ 
cess may be non-stationary (e.g., different types of users 
may use the Internet at different time of the day) or even ad¬ 
versarial (e.g., an attacker may present certain MDPs in ear¬ 
lier tasks in LLRL to cause an algorithm to perform poorly 
in future ones). We now study a more general algorithm, 
EorcedExp, based on forced exploration, and prove a re¬ 
gret upper bound. The next subsection will present a match¬ 
ing lower bound, indicating the algorithm’s optimality. 

Before the game starts, the algorithm chooses s. fixed se¬ 
quence of “probing rates”: rp, ... ,r]T G [0,1]. In round t, 
it chooses actions accordingly: P {At = S} = 1 — 77 * and 
P{At = P} = Pt- The main result in this subsection is as 
following, proved in Appendix B. 

Theorem 2. Let rjt = (polynomial decaying rate) for 
some parameter a G (0,1). Then, for any given 5 G (0,1), 


i?(EORCEDExP, T) < C*p^ 



( 2 ) 


with probability 1 — 5. The expected regret is 
.R(EorcedExp, T) < C*P 3 T°‘ -I- Both 

bounds are 0 (Vt) by by choosing a = 1/2. 

The results show that EorcEDExp eventually performs 
as well as the hypothetical optimal strategy that knows the 
type of Mt in every round t, no matter how Mt is gener¬ 
ated. Moreover, the per-round regret decays on the order of 
1 / a/T, which we will show to be optimal shortly. 


Lower Bounds 

The main result in this subsection. Theorem 3, shows the 
0(Vt) regret bound for EorcEDExp is essentially not im¬ 
provable, in term of T-dependence, even in the stochastic 
case. The idea of the proof, given in Appendix C, is to con¬ 
struct a hard instance of stochastic OCCP. On one hand, 
0 (a/T) regret is suffered unless all C types are discovered. 
On the other hand, most of the types have small probability 
Pm of being sampled, requiring the learner to take the ex¬ 
ploration action P many times to discover all C types. The 
lower bound follows from an appropriate value of pm- 
Theorem 3. There exists an OCCP where every admissible 
algorithm has an expected regret of Q,(\/T), and for suffi¬ 
ciently small 5, the regret is Q(\/T) with probability 1 — 5. 

Note our goal here is to find a matching lower bound in 
terms of T. We do not attempt to match dependence on other 
quantities like C, which are often less important than T. 

The lower bound may seem to contradict ExpEirst’s 
logarithmic upper bound in Proposition 1. However, that 
upper bound is problem specific and requires knowledge 
of Pm- Without knowing pm, the algorithm has to choose 
Pm = 0(^) in the probing phrase; otherwise, there is 
a chance it may not be able to discover a type M with 
p{M) = H(^), suffering H(-\/T) regret. With this value of 

Pm, the bound in Proposition 1 has an 0{s/T) dependence. 


Forced-Exploration Strategy 

While ExpEirst is effective in stochastic OCCP, it requires 
to know Pm, and the probing phase may be too long for 


Application to PAC-MDP Lifelong RL 

Building on the OCCP results established in the previous 
section, we now turn to lifelong RL. 



Preliminaries 

We consider RL (Sutton and Barto 1998) in discrete-time, fi¬ 
nite MDPs specified by a five-tuple: {S, A, P, R, 7), where 
S is the set of states (S := |5|), ^ the set of actions 
(A := 1^1), P the transition probability function, R : 
5 X ^ [0,1] the reward function, and 7 € (0,1) the dis¬ 

count factor. Initially, P and R are unknown. Given a policy 
TT : 5 its state and state-action value functions are 

denoted by V'^{s) and Q'^{s,a), respectively. The optimal 
value functions are V* and Q*. Finally, let Fmax be a known 
upper bound of I^*(s), which is at most 1/(1 — 7) but can 
be much smaller. 

Various frameworks have been studied to capture the 
learning speed of single-task online RL algorithms, such as 
regret analysis (Jaksch, Ortner, and Auer 2010). Here, we fo¬ 
cus on another useful notion known as sample complexity of 
exploration (Kakade 2003), or sample complexity for short. 
Some of our results, especially those related to cross-task 
exploration and OCCP, may also find use in regret analysis. 

Any RL algorithm A can be viewed as a nonstationary 
policy, whose value functions, and are defined sim¬ 
ilarly to the stationary-policy case. When A is run on an un¬ 
known MDP, we call it a mistake at step t if the algorithm 
chooses a suboptimal action, namely, (st) > e. 

We define the sample complexity of A, (/(e, S) as the maxi¬ 
mum number of mistakes, with probability at least 6. If C is 
polynomial in S, A, 1/(1 — 7), 1/e, and ln(l/(5), then A is 
called PAC-MDP (Strehl, Li, and Littman 2009). 

Most PAC-MDP algorithms (Kearns and Singh 2002; 
Brafman and Tennenholtz 2002; Strehl, Li, and Littman 
2009) work by assigning maximum reward to state-action 
pairs that have not been visited often enough to obtain reli¬ 
able transition/reward parameters. The Finite-Model-RL 
algorithm used for LLRL (Brunskill and Li 2013) leverages 
a similar idea, where the current RL task is close to one of a 
finite set of known MDP models. 

Cross-task Exploration in Lifelong RL 

In lifelong RL, the agent seeks to maximize total reward as 
it acts in a sequence of T tasks. If the tasks are related, learn¬ 
ing speed is expected to improve by transferring knowledge 
obtained from prior tasks. Following previous work (Wil¬ 
son et al. 2007; Brunskill and Li 2013), and motivated by 
many applications (Chu and Park 2009; Fern et al. 2014; 
Nikolaidis et al. 2015; Liu and Koedinger 2015), we assume 
a finite set A4 of possible MDPs. The agent solves a se¬ 
quence of T tasks, with Mt G A4 denoting the (unknown) 
MDP of task t. Before solving the task, the agent does not 
know whether or not Mt has been encountered before. It 
then acts in Mt for H steps, where H is given, and can 
take advantage of any information extracted from solving 
prior tasks {Mi, , Mt- 1 }. Our setting is more general, al¬ 
lowing tasks to be chosen adversarially, in contrast to prior 
work that focused on the stochastic case (Wilson et al. 2007; 
Brunskill and Li 2013). 

In comparison to single-task RL, performing additional 
exploration in a task (potentially beyond that needed for 
reward maximization in the current task), may be advan¬ 
tageous in the LLRL setting, since such information may 


Algorithm 1 Lifelong RL based on ForcEDExp 

1: Input: a G (0,1), m G N, L € N 
2: Initialize Ad ^ 0 

3: for f = 1,2,... do 

4: Generate a random number ^ ^ Uniform(0,1) 

5: if ^ (probing to discover new MDP) then 

6: Run PAC-Explore with parameters m and L to 

fully explore all states in Mt, so that every action 
is taken in every state for at least m times. 

7: After PAC-Explore finishes, choose actions by 

an optimal policy of the empirical model Mf 
8: if for all existing models M G Ad, Mt has a non¬ 

overlapping confidence intervals in some state- 
action pair’s transition/reward parameters then 
9: Ad ^ Ad U {Mt} 

10: end if 

11: else 

12: Run Einite-Model-RL with Ad 

13: end if 

14: end for 


help the agent perform better in future tasks. Indeed, prior 
work (Brunskill and Li 2013) has demonstrated that learn¬ 
ing the latent structure of the possible MDPs that may be 
encountered can lead to significant reductions in the sam¬ 
ple complexity in later tasks. We can realize this benefit by 
explicitly identifying this latent shared structure. 

This observation inspired our abstraction of OCCP, which 
we now formalize its relation to LLRL. Here, the probing 
action (P) corresponds to doing full exploration in the cur¬ 
rent task, while the skipping action (S) corresponds to ap¬ 
plying transferred knowledge to accelerate learning. We use 
our OCCP EorcEDExp algorithm resulting in Algorithm 1; 
overloading terminology, we refer to this LLRL algorithm as 
EORCEDExp. In contrast, the two-phase LLRL algorithm of 
Brunskill and Li (2013) essentially uses ExpEirst to dis¬ 
cover new MDPs, and is referred to as ExpEirst. 

At round t, if probing is to happen, EorcedExp performs 
PAC-Explore (Guo and Brunskill 2015), outlined in Al¬ 
gorithm 2 of Appendix D, to do full exploration of Mt to 
get an accurate empirical model Mf To determine whether 
Mt is new, the algorithm checks if Mf^ parameters’ confi¬ 
dence intervals are disjoint from every M G Ad in at least 
one state-action pair. If so, we add Mt to the set Ad. 

If probing is not to happen, the agent assumes Mt G Ad, 
and follows the Einite-Model-RL algorithm (Brunskill 
and Li 2013), which is an extension of Rmax to work with 
finitely many MDP models. With Pinite-Model-RL, the 
amount of exploration scales with the number of models, 
rather than the number of state-action pairs. Therefore, the 
algorithm gains in sample complexity by reducing unnec¬ 
essary exploration from transferring prior knowledge, if the 
current task is already in Ad. 

Note that Algorithm 1 is a meta-algorithm, where single- 
task-RL components like PAC-Explore and Einite- 
Model-RL may be replaced by similar algorithms. 





Remark. ForcedExp may appear naive or simplistic, as 
it decides whether to probe a new task before seeing any data 
in Mt. It is easy to allow the algorithm to switch from non¬ 
probing (S) to probing (P) while acting in Mt, whenever 
Mt appears different from all MDPs in Ai (again, by com¬ 
paring confidence intervals of model parameters). Although 
this change can be beneficial in practice, it does not improve 
worst-case sample complexity: if we are in the non-probing 
case running Finite-Model-RL in a MDP not in M, there 
is no guarantee to identify the current task as a new one. This 
is because by assuming that the current MDP is one of the 
models in M., the learner may follow a policy that never suf¬ 
ficiently explores informative state-action pair(s) that could 
have revealed the current MDP is novel. Therefore, from a 
theoretical (worst-case) perspective, it is not critical to allow 
the algorithm to switch to the probing mode. 

Similarly, switching from probing to non-probing in the 
middle of a task is in general not helpful, as shown in the 
following example. Let 5 = {s} contain a single state, so 
P{s\s,a) = 1 and MDPs in Ai differ only in the reward 
function. Suppose at round t, the learner has discovered a set 
of MDPs Ai from the past, and chooses to probe, thus run¬ 
ning PAC-Explore. After some steps in Mt, if the learner 
switches to non-probing before trying every action m times 
in all states, there is a risk of under-exploration: Mt may be 
a new MDP not in Ai; it has the same rewards on optimal 
actions for some M G Ai, but has even higher reward for 
another action that is not optimal for any M' G Ai.'Qy ter¬ 
minating exploration too early, the learner may fail to iden¬ 
tify the optimal action in Mt, ending up with a poor policy. 

Sample-Complexity Analysis 

This section gives a sample-complexity analysis for Algo¬ 
rithm 1. For convenience, we use 6m to denote the dynam¬ 
ics of an MDP M G Ai: for each (s, a), 9M{‘\s,a) is an 
(S'-!-1) -dimensional vector, with the first S components giv¬ 
ing the transition probabilities to corresponding next states, 
P(s'|s, a), and the last component the average immediate 
reward, R{s,a). The model difference in (s,a) between 
M and M', denoted ||0 m(-|s, a) — 0M'{-\s,a)\\, is the £ 2 - 
distance between the two vectors. Finally, we let N be an 
upper bound on the number of next states in the transition 
models in all MDPs M G Ai; note that N is no larger than 
S but can be much smaller in many problems. 

The following assumptions are made in the analysis: 

1. There exists a known quantity F > 0 such that for every 
two distinct MDPs M, M' G Ai, there exists some (s, a) 
so that ||0M(-|'S)a) — 0M'(’|sj®)ll > T; 

2. There is a known diameter D, such that: for any M G Ai, 
any states s and s', there is a policy tt that takes an agent 
to navigate from s to s' in at most D steps on average; 

3. There sit H > Hq steps to solve each task Mt, where 
Ho = O (SAN log 

The first assumption requires two distinct MDPs differ by 
a sufficient amount in their dynamics in at least one state- 
action pair, and is made for convenience to encode prior 
knowledge about F. Note that if F is not known beforehand, 
one can set F to Fq = 0(e(l — 7)/('\/iVFmax)): if two 


MDPs differ by no more than Fq in every state-action pair, 
an e-optimal policy in one MDP will be an 0(e)-optimal pol¬ 
icy in another. The second and third assumptions are the ma¬ 
jor ones needed in our analysis. The diameter D, introduced 
by Jaksch, Ortner, and Auer (2010), is typically not needed 
in single-task sample-complexity analysis, but it seems non¬ 
trivial to avoid in a lifelong learning setting. Without the 
diameter or the long-horizon assumption, a learner can get 
stuck in a subset of states that prevent it from identifying the 
current MDP. In such situations, it is unclear how the learner 
can reliably transfer knowledge to better solve future tasks. 

With these assumptions, the main result is as follows. 
Note that it is possible to use refined single-task analysis 
such as Lattimore and Hutter (2012) to get better constants 
for po and po below. We defer that to future work, and in¬ 
stead focus on showing the benefits of lifelong learning. 
Theorem 4. Let Algorithm 1 with proper choices of param¬ 
eters be run on a sequence of T tasks, each from a set Ai 
of C MDPs. Then, with prob. 1 — i5, the number of steps in 
which the algorithm is not e-optimal across all T tasks is 
0(^poT -f CpoVrin y), where po = CD/T^ and po = H. 

While single-task RL typically has a per-task sample 
complexity that at least scales linearly with SA, Algo¬ 
rithm 1 converges to a per-task sample complexity of O{po), 
which is often much lower. Furthermore, a bound on the 
expected sample complexity can be obtained in a similar 
way, by the corresponding expected-regret bound in Theo¬ 
rem 2. Intuitively, in the OCCP setting, we quantified the 
loss (equivalently, regret); in LLRL, the loss corresponds to 
number of non-e-optimal steps, and so a loss bound trans¬ 
lates directly into a sample-complexity bound. 

The proof (Appendix E) proceeds by analyzing the sam¬ 
ple complexity bounds for all four possible cases (corre¬ 
sponding to the four entries in the OCCP loss matrix in Ta¬ 
ble 1) when solving the Mt, and then combining them with 
Theorem 2 to yield the desired results. A key step is to en¬ 
sure that when probing happens, the type of Mt will be dis¬ 
covered successfully with high probability. This is achieved 
by a couple of key technical lemmas below, which also elu¬ 
cidate where our assumptions are used in the analysis. 

The first lemma ensures all state-actions can be visited 
sufficiently often in finite steps, when the MDP has a small 
diameter. For convenience, define Ho{m) := 0{SADm). 
Lemma 5. For a given MDP, PAC -EXPLORE with input 
m > mo and L = SD will visit all state-action pairs at 
least m times in no more than Ho (m) steps with probability 
1 — (5, where mo = O i^ND^ log is some constant. 

The second lemma establishes the fact that when PAC- 
Explore is run on a sequence of T tasks, with high prob¬ 
ability, it successfully infers whether Mt has been included 
in f4, for every t. This result is a consequence of Lemma 5 
and the assumption involving F. 

Lemma 6. With input parameters H > Ho{m) and m = 
72N log D^} in Algorithm 1, the following 

holds with probability 1 — 25: for every task in the sequence, 
the algorithm detects it is a new task if and only if the corre¬ 
sponding MDP has not been seen before. 




Figure 1: Gridworld: nonstationary task selection. 10-task 
smoothed running average of reward per task with 1 std error bars. 

Experiments 

Our simulation results illustrate that our lifelong RL set¬ 
ting can capture interesting domains, and to demonstrate the 
benefit of our introduced approach over a prior algorithm 
with formal sample-complexity guarantees (Brunskill and Li 
2013) that is based on ExpFirst. Due to space limitations, 
full details are provided in Appendix F. 

Gridworld. We hrst consider a simple 5 by 5 stochastic grid- 
world domain with 4 distinct MDPs to illustrate the salient 
properties of ForcEDFxp. In each of the 4 MDPs one cor¬ 
ner offers high reward (sampled from a Bernoulli with pa¬ 
rameter 0.75) and all other rewards are 0. In MDP 4 both the 
same corner as MDP 3 is rewarding, and the opposite corner 
is a Bernoulli with parameters 0.99. 

In the stochastic setting when all tasks are sampled with 
equal probability, we compared ExpFirst, ForcEDExp 
and HMTL—a Bayesian hierarchical multi-task RL algo¬ 
rithm (Wilson et al. 2007). As expected, all approaches did 
well in this setting. We next focus on comparing ExpFirst 
and FORCEDFxPwhich have finite sample guarantees. 

We hrst consider tasks sampled from nonstationary distri¬ 
butions. Across 100 tasks all 4 MDPs have identical frequen¬ 
cies, but an adversary chooses to only select from MDPs 1-3 
during the hrst (probing-only) phrase of ExpFirst before 
switching MDP 4 for 25 tasks, and then switching back to 
randomly selecting the hrst three MDPs. MDP 4 can obtain 
similar rewards as MDP 1 using the same policy as for MDP 
1, but can obtain higher rewards if the agent explicitly ex¬ 
plores to discover the state with higher reward. ForcEDExp 
will randomly probe MDP 4, thus identifying this new opti¬ 
mal policy, which is why it eventually picks up the new MDP 
and obtains higher reward (See Figure 1).. ExpFirst some¬ 
times successfully infers the task belongs to a new MDP, but 
only if it happens to encounter the state that distinguished 
MDPs 1 and 4. This illustrates the beneht of continued ac¬ 
tive exploration in nonstationary or adversarial settings. 
Simulated Human-Robot Collaboration. We next con¬ 
sider a more interesting human-robot collaboration problem 
studied by Nikolaidis et al. (2015). In this work, the authors 
learned 4 models of user types based on prior data collected 
about a paired interaction task in which a human collabo¬ 
rates with a robot to paint a box. Using these types as a latent 
state in a mixed-observability MDP enabled significant im¬ 


Table 2: Average per-task reward (and std. deviation) in each phase 
and overall. Gains with statistical significance are highlighted. 



Phase 1 

Phase 2 

Overall (80 tasks) 

ExpFirst 

18305(1609) 

19428(1960) 

18543(1683) 

FORCEDEXP 

18745 (482) 

19012(1904) 

18801 (1923) 


provements over not modeling such types in an experiment 
with real human robot collaborations. 

In our LLRL simulation each task was randomly sampled 
from the 4 MDP models learned by Nikolaidis et al. (2015). 
This domain was much larger than our grid world environ¬ 
ment, involving 605 states and 27 actions. It is typical in 
such personalization problems that not all user types have 
the same frequency. Here, we chose the sampling distribu¬ 
tion/i = (0.07,0.31,0.31,0.31). The length of ExpFirst’s 
initial proving period is dominated by ^ . Experi¬ 

ments were repeated 30 runs, each consisting of 80 tasks. 

The long probing phase of ExpFirst is costly, especially 
if the total number of tasks is small, since too much time 
is spent on discovering new MDPs. This is shown in Ta¬ 
ble 2, where our ForcEDExp demonstrates a significant 
advantage by leveraging past experience much earlier than 
ExpFirst, leading to significantly higher reward both dur¬ 
ing phase 1 and overall (Mann-Whitney U test, p < 0.001 
in both cases). Of course, eventually ExpFirst will exhibit 
near-optimal performance in its second (non-probing) phase, 
whereas ForcEDExp will continue probing with diminish¬ 
ing probability. However, ForcEDExp can exhibit substan¬ 
tial jump-start benefit when the underlying MDPs are drawn 
from a stationary but nonuniform distribution. 

These results suggest ForcEDExp achieves comparabe 
or substantially better performance than prior methods, es¬ 
pecially in nonuniform or nonstationary LLRL problems. 

Conclusions 

In this paper, we consider a class of lifelong RL problems 
that capture a broad range of interesting applications. Our 
work emphasizes the need for efficient cross-task explo¬ 
ration that is unique in lifelong learning. This led to a novel 
online coupon-collector problem, for which we give optimal 
algorithms with matching upper and lower regret bounds. 
With this tool, we develop a new lifelong RL algorithm, 
and analyze its total sample complexity across a sequence 
of tasks. Our theory quantifies how much gain is obtained 
by lifelong learning, compared to single-task learning, even 
if the tasks are adversarially generated. The algorithm was 
empirically evaluated in two simulated problems, including 
a simulated human-robot collaboration task, demonstrating 
its relative strengths compared to prior work. 

In the future, we are interested in extending our work to 
LLRL with continuous MDPs. It is also interesting to inves¬ 
tigate the empirical and theoretical properties of Bayesian 
approaches, such as Thompson sampling (Osband, Russo, 
and Van Roy 2013), in lifelong RL. These algorithms allow 
rich information to be encoded into a prior distribution, and 
empirically are often effective at taking advantage of such 
prior information. 
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A Proof for Proposition 1 

For convenience, statements of theorems, lemmas and 
propositions from the main text will be repeated when they 
are proved in the appendix. 

Proposition 1. For any 5 G (0,1), let E = In 
where /im = Then, with probabil¬ 
ity 1 - (5, i?(ExpFlRST, T) < More¬ 
over, if E = ^ In then the expected regret is 

i?(ExpFlRST, T) < (in + l). 

Proof. We start with the high-probability bound. Eix any 
M G Ai. The probability that it is not sampled in the first E 
rounds can be bounded as follows: 

F {M {Ml,..., Me}} 

= 

< exp(—(by inequality 1 — a; < e~^) 

< exp(ln(i5/rm)) (by definition, p.^ < f{M)) 

= Prn^ • (3) 

Consequently, we have 

F{3M e M,Mi {Mi,...,Me}} < C6p^ < 6, 

where the first inequality is due to Equation 3 and a union 
bound applied to all M G A4, and the second inequality 
follows from the observation that C < 1/ pm- 

We have thus proved that, with probability at least 1 — 6, 
all types in A4 will be sampled at least once in the first E 
rounds, and ExpEirst will have the minimal loss po for all 
t > E. Thus, with probability 1 — (5, we have 

L(ExpEirst,T) =p 2 C'* + pi(E;-C*)+po(T-E;), (4) 
where the first two terms correspond to loss incurred in the 
first E rounds, and the last term corresponds to loss incurred 
in the remaining T — E rounds. Subtracting the optimal loss 
of Equation 1 from Equation 4 above gives the desired high- 
probability regret bound: 

i?(ExpEiRST,T) = {pi - pf){E - C*) (5) 

< {pi-Po)E. 

We now prove the expected regret bound. Since Equa¬ 
tion 5 holds with probability at least 1 — 5, the expected 
total regret of ExpEirst can be bounded as: 

.R(ExpEirst,T) 

< {pi- pii){E-C*) + {p^- pii)5T 

< {pi - po)E E {p^ - pii)6T 

< — -^ In^-^ + (P3 - Po)5T, (6) 

The right-hand side of the last equation is a function of 5, in 
the form of f{5) := a — bla.5 -\- c5, for a = In 

b = , and c = (ps — po)T. Because of convexity of /, 

its minimum is found by solving f'{6) = 0 for 6, giving 

S* = - = ~ Po 

C (PS — Po)pmT 

Substituting 6* for 6 in Equation 6 gives the desired bound. 


B Proofs for ForcedExp 

This subsection gives complete proofs for theorems about 
EorcEDExp. We start with a few technical results that are 
needed in the main theorem’s proofs. 

B.l Technical Lemmas 

The following general results are the key to obtain our ex¬ 
pected regret bounds for EorcEDExp. 

Lemma 7. Fix M G A4, and let 1 < ti < t 2 < ■ ■ ■ < < 

T be the rounds for which Mt = M. Then, the expected total 
loss incurred in these rounds is bounded as: 

Lm < {rnpo p2 — P‘i)Li -\- {ps — pf)L2 + P 1 L 3 , 

where 

Li ■■= 11(1 

i j<i 

1(2 — 5111(1 “•*> 

i j<i 


^3 := 51 11(1 51 % 


i \j<i j>i j 

Proof. Let Zm(ForcEDExp) be the expected total loss in¬ 
curred in the rounds t where Mt = M: 1 < ti < t 2 <■■■ < 
tm <T for some m > 0. Let I G {1,2,..., m, m -f 1} be 
the random variable, so that M is first discovered in round 
f/. That is, 

_ro, ifj</ 

U, ifj = /. 

Note that I = m-\- 1 means M is never discovered; such a 
notation is for convenience in the analysis below. The corre¬ 
sponding loss is given by 

(/ — l)p3 -f P2 + 5]] (Pol = 0 } -f pil {Atj = 1 }) , 

3>I 

whose expectation, conditioned on /, is at most 
(/ - l)p3 + P2 + 5^ (PO + PlVt„) ■ 

3>I 

Since EorcEDExp chooses to probe in round t with proba¬ 
bility rjt, we have that 


>{/ = i} = ll(l-ptJpt^ 


Therefore, Lm(ForcEDExp) can be bounded by 


< 5^ P {/ = i} (/ - 1)P3 + P2 + 5^ (po + , 


—{mpo + P2 — P3 )Li -I- (p3 — Po)L 2 + P1L3,, 


where Li, L 2 and L 3 are given in the lemma statement. 


Now we can obtain the following proposition: 



Proposition 8. If we run ForcEDExp with non-increasing 
exploration rates rji > ■ ■ ■ > rjx > then 

C* ^ 

E[L(F0RCEDExp, T)] < pqT H- — + Pi Vt- 


Proof. Eor each M G f4. Lemma 7 gives an upper bound 
of loss incurred in rounds t for which Mt = M: 

Lm < (mpo + P2 - P3)Li + (p3 - Po)L2 + P 1 E 3 , 

where Li, L 2 and Z 3 are given in Lemma 7. We now bound 
the three terrns of Lm(ForcEDExp), respectively. 

To bound Li, we define a random variable I, taking values 
in { 1 , 2 ,..., TO, TO + 1 }, whose probability mass function is 
given by 


P{J = z} 


ifi<m 

nj<m(l-%). ifi = m + l. 


B.2 Proof for Theorem 2 


Theorem 2. Let = f “ (polynomial decaying rate) for 
some parameter a € (0,1). Then, for any given 5 € (0,1), 


i?(FORCEDEXP, T) < C'*P3 



( 8 ) 


with probability 1 — <5. The expected regret is 
.R(ForcedExp, T) < C*P 3 T°' + Both 

bounds are 0{VT) by by choosing a = 1 / 2 . 


Proof The proof is split into two parts, for the two stated 
bounds. 

High-probability Regret Bound. Fix any M & M, and let 

1 < fi < t 2 < ... < tm < T ht the rounds for which 
Mt = M. Then, for any to' < to, we can upper-bound the 
probability that M remains undiscovered after the first to' 
rounds for which Mt = M: 


Therefore, I is like a geometrically distributed random vari¬ 
able, except that the parameter for the ith draw is not the 
same and is ryj.. Consequently, 

Zi = i} < 1- 

i 

To bound L 2 , we use the same random variable I: 

m 

L 2 = {I = i} ■ i 


P{M =]J(l-rytJ < exp(-y] rytJ , 

i=l i=l 

where the inequality is due to the fact that 1 — a; < e~^. We 
will show that for sufficiently large to', the right-hand side 

above, exp(— ry^J, is at most 5/C*', in other words, 
with probability at least 1—J/C*, item M will be discovered 
after appearing to' times for sufficiently large to'. Indeed, 


> 


< P {/ >i} (Corollary of Theorem 3.2.1 of (Chung 2000)/ 

m 

= ~ Ptj) (®y definition of / in Equation 7) 

i—1 j<i 
m 

< ~ VtT) (®y assumption that ryi > • • • > pt) 

i=l j<i 

= -(l-(l-ryT)“)<-. 

Vt Vt 


=1 


To bound L 3 , we have 

m m m m 

^3 < Vt, 

i-i j<i j=i j=i j=i 

Putting all three bounds above, we have 

m 

Zm(ForcedExp) < mpo + -f Pi E Vt, ■ 


Vt 




Now sum up all Lm (FORCEDExp) over all M G Af that 
appear in the sequence {Mt)t, and we have 


> 


> 


Vt (monotonicity of {pt)t) 

i—T—m' + l 
T 

y^ t~°‘ (definition of pt) 

f 

J 

1 — a 
1 d 
1 — a dt 




• (T - (T - to' + 1)) 


t^T 

(concavity of 


r"“(TO'- 1). 

Therefore, we will have P {M ^ } < 6/C* if 

CL 

s 


T “(to' — 1) > In or equivalently, m! >Tq, where 


To = T“ In — -f 1. 

0 

It follows that, with probability at least 1—5/C, the total loss 
associated with item M (that is, the total loss accumulated 
in {T, ^ 2 ) • • •, tm}) is at most 


E[L(ForcedExp,T)] < poT ■ 


C*P3 

Vt 


Pi E • 


i-1 


P3min{TO,To}-f po(to-To)+, (9) 

where (a:)+ := max{a;, 0 }. 

Define Ml* := {Mi,..., Mt} C be the set of types 
that appear in the sequence {Mt)t. Clearly, C* = |A^*|. 



Summing Equation 9 over all M e A4* and applying a 
union bound, we have the following that holds with prob¬ 
ability at least 1 — 5: 

L(ForcedExp, T) 

- X! {P3^^n{m{M),To} + poim{M) - To)+) 

M&M- 

< C*p^To+ Y. poMM)-To)+, (10) 

M&M* 

where m{M) := |{1 < f < T | Mt = M}\ is the number of 
times M appears in T rounds. Now consider the optimal yet 
hypothetical strategy, whose total loss, given in Equation 1, 
can be written as 


L*{T) = Y.mgm- (p 2 +Po(mm{TO(M),To} - 1) 

+PoMM) - To)+) . (11) 

In Equation 11, for each M G Ai*, the first two terms cor¬ 
respond to the loss accumulated in the first min{TO(M), Tq} 
times where Mt = M, and the last term for the remaining 
rounds where Mt = M. It then follows from Equations 10 
and 11 that, with probability at least 1 — 5, 

i?(EORCEDExP, T) < C*pzTq = C*P 3 (T‘^ In — -f 1), 

5 

as stated in the theorem. 

Expected Regret Bound. Given polynomial exploration 
rates pt = we have 


= 


< 


< 


i+E^”“ 

t^2 

1 + f 


1-f 


1 




\ — a 

1 + (Ti-“ - 1) 

1 — a 
\ — a 


The total regret follows immediately from Proposition 8. 
Furthermore, if one sets a = 1/2, the regret bound becomes 

{C*P3 + 2p^)Vt = 0{Vt). ■ 


C Proof for Theorem 3 

Theorem 3. There exists an OCCP where every admissible 
algorithm has an expected regret of and for suffi¬ 

ciently small 6, the regret is fl{s/T) with probability 1 — 5. 

Proof We construct a stochastic OCCP with Ai = 
{1,2,..., C} and distribution p so that 


with the convention that Tm = oo if {f | Mt = M, At = 
P} = 0. Furthermore, let 1 < ti < t 2 < ■ ■ ■ < Ie ^ T he 
the rounds in which probing (At = P) happens; denote by E 
the set {ti,t 2 , ■ ■ ■ Ae}- Since the two random variables Mt 
and At are independent, we have for any i G {1, 2,..., P} 
and any M G Ai that 

¥{Mt, =M} = p{M). 

We start with the expected-regret lower bound and let A 
be any admissible algorithm. Conditioning on E being the 
rounds of probing, we want to lower bound the number E 
of exploration rounds so that the probability of not discov¬ 
ering all items in A4 is at most 5 (which is necessary for 
the expected regret to be 0{VT)). First, note that the events 
{Tm < oo} MeAt negatively correlated, since discov¬ 
ering some Ml in E can only decrease the probability of 
discovering M 2 f Mi in E. Therefore, we have 


PjVM, Tm < 00 } < 

Jf IP{7 m < 00 } 


MeAt 


C-l 

< 

f[ P [Tm < 00 } 


M=1 

< 

{l-{l-pm)^)^-^ 

Making the last expression to be 1 — 5, we have 


In (l- {I - 5)<^-i j 
ln(l - Pm) 

l-^m 



for sufficiently small pm and 5. 

For simplicity, assume po = 0 without loss of generality; 
otherwise, we can just define a related problem with p) := 
Pi — Po, where the loss is just shifted by a constant and the 
regret remains unchanged. With this assumption, the optimal 
expected loss given in Equation 1 becomes L*{T) = C*p 2 - 
With Po = 0, the expected loss of A is at least {E — 
C*)pi C*p 2 -f 5(r — E)pmP 3 , where the first two terms 
are for the loss incurred during the E probing rounds; and 
the last term for the 5-probability event that some item is 
not discovered in the probing rounds, which leads to po loss 
when it is encountered in any of the remaining T-Ei rounds. 

The regret of A, by comparing its loss to L*(T), can be 
lower bounded by 

{E-C*)pi-h5{T-E)pmP3 = El (poTpra -G — Ivi^ 

\ Pm 0 





■> 

1 Cllm.’! 


ifM <C 
if M = C, 


where pm = IjsjT 1. For every M G Ai, define Tm G 
{1,..., T, 00 } as the first time M is collected; that is 

Tm ■= minjf | M* = M, A* = P}, 


giving an expected-regret lower bound by observing the fact 
that Pm = i/s/r. 

The high-probability lower bound can be proved by very 
similar calculations, with the observation that all C types 
need to be collected in order to have a regret bound that 
holds with probability 1 — 5, for sufficiently small 5. ■ 



D Algorithm Pseudocode 

The following algorithm, PAC -Explore of Guo and Brun- 
skill (2015), is a key component in Algorithm 1. It takes as 
input two parameters: threshold m for determining a state- 
action pair is known or not, and planning horizon L that is 
used to compute an exploration policy. 

Algorithm 2 PAC-Explore of Guo and Brunskill (2015) 

0: Input: m (known threshold), L (planning horizon) 

1: while some (s, a) has not been visited at least m times 

do 

2 : Let s be the current state 

3: if all a have been tried m times then 

4: Start a new L-step episode 

5: Construct an empirical known-state MDP Mk 

with the reward of all known (s, a) pairs set to 
0, all unknown set to 1 (maximum reward value), 
the transition model of all known (s, a) pairs set to 
the estimated parameters and the unknown to self 
loops 

6: Compute an optimistic L-step policy tt for Mk 

1: Erom the current state, follow if for L steps, or until 

an unknown state is reached 
8: else 

9: Execute a that has been tried the least 

10: end if 

11: end while 

E Proofs for LLRL Sample Complexity 

This section provides details of the sample-complexity anal¬ 
ysis of Algorithm 1, leading to the main result of Theorem 4. 

E.l Proof of Lemma 5 

Lemma 5. Eor a given MDP, PAC-Explore with input 
m > mo and L = 3D will visit all state-action pairs at 
least TO times in no more than Hq (to) steps with probability 
1 — 5, where toq = O {ND^ log is some constant. 

Proof. The proof follows closely to that of Guo and Brun¬ 
skill (2015). Consider the beginning of an episode, and let 
K be the set of known state-action pairs which have been 
visited by the agent at least to times. Eor each (s, a) € 
k, the £i distance between the empirical estimate and the 
actual next-state distribution is at most (Lemma 8.5.5 of 

Kakade (2003)): a = ^ log Let Mk be the known- 

state MDP, which is identical to Mk except that the tran¬ 
sition probabilities are replaced by the true ones for known 
state-action pairs. Eollowing the same line of reasoning as 
Guo and Brunskill (2015), one may lower-bound the prob¬ 
ability that an unknown state is reached within the episode 
by Pe > 1/6 — 3aD. Therefore, pe is bounded by 1/12 as 
long as aD < 1 /36. The latter is guaranteed if to > mo = 
O {^ND"^ log "y). The rest of the proof is the same as Guo 
and Brunskill (2015), invoking Lemma 56 of Li (2009) to 
get an upper bound of H, stated in the lemma as Ho (to) . ■ 


E.2 Proof of Lemma 6 

Lemma 6. With input parameters H > Ho{m) and to = 
727V log max{r“^,I9^} in Algorithm 1, the follow¬ 
ing holds with probability 1 — 26: for every task in the se¬ 
quence, the algorithm detects it is a new task if and only if 
the corresponding MDP has not been seen before. 

Proof. Eor task M*, let St be the event that all state-action 
pairs become known after H steps; Lemma 5 with a union 
bound shows all events {i£t}te{i, 2 ,...,T} hold with proba¬ 
bility at least 1 — 5. Eor every fixed t, under event St, ev¬ 
ery state-action pair has at least to samples to estimate its 
transition probabilities and average reward after H steps. 
Applying Lemma 8.5.5 of Kakade (2003) on the transition 
distribution, we can upper bound, with probability at least 
1 — 2 SAT ^ error of the transition probability estimates 
by: 



Similarly, an application of Hoeffding’s inequality gives the 
following upper bound, with probability at least 1 — 2 SAT ^ 
on the reward estimate: 



Applying a union bound over all states, actions, and tasks, 
the above concentration results hold with probability at least 
1 — 5 for an agent running on T tasks. The rest of the proof 
is to show that task identification succeeds when the above 
concentration inequalities hold. 

To do this, consider the following two mutually exclusive 
cases: 

1. If Mt is new, then, by assumption, for every M' e A4, 
there exists some (s, a) for which the two models dif¬ 
fer by at least E in £2 distance; that is, \\0Mt{‘\s,a) — 
0m'{'\s, a)II 2 > r. It follows from the equality, 

\\0Mti-\s,a) - 6lM'(-|s,a)||i 

= ^ {0Mtis'\s,a) - 9M'is'\s,a)f 

l<s'<S 

(error in transition probability estimates) 

+ (^Mt (<5' + l|s, a) — 0M'("^ “b ll®) ®)) > 

(error in reward estimate) 

that at least one of two terms on the right-hand side above 
is at least r^/2. 

If the first term is larger than r^/2, then the £i distance 
between the two next-state transition distributions is at 
least r/-\/2, which is larger than 2eT = 2r/3. It im¬ 
plies that the fi-balls of transition probability estimates 
for (s, a) between Mt and M' do not overlap, and we will 
identify Mt as a new MDP. Similarly, if the second term 
is larger than r^/2, then using we can still identify Mt 
as a new MDP. 

2. If Mt is not new, we claim that the algorithm will cor¬ 
rectly identify it as some previously solved MDP, say 





M" e TM. In particular, confidence intervals of its es¬ 
timated model in every state-action pair must overlap 
with M", since both models’ confidence intervals con¬ 
tain the true model parameters. On the other hand, for any 
M' G Ai\ {M”}, its model estimate’s confidence inter¬ 
vals do not have overlap with that of Mt’s in at least one 
state-action pair, as shown in case 1. Therefore, the algo¬ 
rithm can find the unique and correct M” G A4 that is the 
same as Mf. 

Finally, the lemma is proved with a union bound over all 
tasks, states and actions, and with the probability that Et 
fails to hold for some t. ■ 

E.3 Proof of Theorem 4 

Theorem 4. Let Algorithm 1 with proper choices of param¬ 
eters be run on a sequence of T tasks, each from a set Ad 
of C MDPs. Then, with prob. 1 — (5, the number of steps in 
which the algorithm is not e-optimal across all T tasks is 
0(poT -I- CpsVTln y), where po = CDfV'^ and ps = H. 

Proof. We consider each possible case when solving the fth 
task, Mt- As shown in Lemma 6, with probability 1 — <5, the 
following event St hold for all t G [T]: after PAC-Explore 
is run on Mt, Algorithm 1 will discover the identity of Mt 
correctly. That is, if Mt is a new MDP, it will be added to Ad; 
otherwise. Ad remains unchanged. In the following, we as¬ 
sume St holds for every t, and consider the following cases: 

(a) Exploitation in discovered tasks: we choose to ex¬ 

ploit (line 12 in Alg 1) and Mt has been already discovered. 
In this case, Finite-Model-RL is used to do model elim¬ 
ination (within Ad) and to transfer samples from previous 
tasks that correspond to the same MDP as the current task 
Mt- Therefore, with a similar analysis, we can get a per-task 
sample complexity of at most 0{CDm) = = po. 

(b) Exploitation in undiscovered tasks: we choose to ex¬ 
ploit and Mt has not been discovered. Running FlNlTE- 
Model-RL in this case can end up with an arbitrarily poor 
policy which follows a non-e-optimal policy in every step. 
Therefore, the sample complexity can be as large as H = p^. 

(c) Exploration: we choose to explore using PAC- 
Explore (lines 6-9 in Alg 1). In this case, with high prob¬ 
ability, it takes at most Ho{m) steps to make every state 
known, so that the model parameters can be estimated to 
within accuracy 0(r). After that, we can reliably decide 
whether Mt is a new MDP or not. With sample transfer, the 
additional steps where e-sub-optimal policies are taken in 
the MDP corresponding to Mt (accumulated across all tasks 
in the T-sequence) is at most (s, the single-task sample com¬ 
plexity. The total sample complexity for tasks correspond¬ 
ing to this MDP is therefore at most HQ(rn)T{Mt) -f Cs = 
P 2 T{Mt) + Cs, where T{Mt) is the number of times this 
MDP occurs in the T-sequence. 

Finally, when Algorithm 1 is run on a sequence of T tasks, 
the total sample complexity—the number of steps in all tasks 


for which the agent does not follow an e-optimal policy— 
is given by one of the three cases above. The sample com¬ 
plexity of exploration can therefore be upper bounded by 
adding Equation 1 to Equation 8 in Theorem 2, completing 
the proof with an application of union bound that takes care 
of error probabilities (those involved in Lemma 6, in upper- 
bounding sample complexity in individual tasks in the proof 
above, and in Theorem 2). ■ 

F Experiment Details 
F,1 Gridworld 

For the grid world domain, all four MDPs had the same 25- 
cell square grid layout and 4 actions (up, down, left, right). 
State si is in the upper left hand corner, state s5 is the up¬ 
per right hand comer, s20 is the lower left hand corner, and 
s25 is the lower right hand corner. All other states are la¬ 
beled sequentially between these. Actions succeed in their 
intended direction with probability 0.85 and with probabil¬ 
ity 0.05 go in each the other three directions (unless halted 
by a wall when the agent stays in the same state). For all 
actions corner states s5, s20, and s25 stay in the same state 
with probability 0.95 or transition back to the start state (for 
all actions). The start state is at the center of the square grid 
(sl3). The dynamics of all MDPs are identical. All rewards 
are sampled from Bernoulli distributions. All rewards have 
parameter 0.0 unless otherwise noted: 

In MDP 1, corner state s20 has a reward parameter of 0.75. 
In MDP 2, corner state s5 has a reward parameter of 0.75. 
In MDP 3, corner state s25 has a reward parameter of 0.75. 
In MDP 4, corner state s25 has a reward parameter of 0.75, 
and corner state si has a reward parameter of 0.99. 

ExpFirst is given an upper bound on the number of 
MDPs (4) and the minimum probability of any of the MDPs 
across the 100 tasks. When we compared to the Bayesian 
hierarchical multi-task learning algorithm HMTL for the 
stochastic setting, we also provided it with an upper bound 
on the number of MDPs, though HMTL is also capable 
of learning this directly. We used HMTL with a two-level 
hierarchy (e.g. a class consists of a single MDP). We ran 
a variant of ForcEDExp labeled “ForcedExp” in the fig¬ 
ures which uses a polynomially decaying exploration rate, 
with a = 0.5, for all experiments. Performance does 
vary with the choice of a but a = 0.5 gave good results in 
our preliminary investigations. Interestingly, this is consis¬ 
tent with the theoretical result that a = 0.5 minimizes de¬ 
pendence on T for polynomially decaying exploration rates 
(c./. Theorem 2). 

We also explored the ForcedExp algorithm using a con¬ 
stant exploration rate ^ for some earlier experiments: as 
expected performance was similar but slightly worse gener¬ 
ally than using a decaying exploration rate, and so we focus 
all comparisons on the decaying exploration rate variant. 

F.2 Simulated Human-Robot Collaboration 

Our abstracted human-robot collaboration simulation comes 
from the recent work of Nikolaidis et al. (2015). The au¬ 
thors showed significant benefits in a human-robot collab¬ 
oration problem, by assuming that user preference models 




over human-robot collaboration could be clustered into a 
small set of types. In their work, they took a previously col¬ 
lected set of data, and clustered it using the Expectation- 
Maximization (EM) algorithm into a set of 4 user types. 
Then, for each new user, they treated the problem as a mixed 
observability Markov decision process, where the (static) 
hidden state is the type of the user. In contrast to their work, 
we handle online lifelong learning across tasks, and our cen¬ 
tral contribution is a formal analysis of the sample complex¬ 
ity and performance, as opposed to Nikolaidis et al. (2015) 
that present exciting empirical results on real human-robot 
interactions, without a theoretical analysis. 

To demonstrate that our approach could also achieve good 
performance for this setting, we performed simulation ex¬ 
periments by constructing a lifelong learning domain in 
which each task is sampled from the four MDP models 
learned by Nikolaidis et al. (2015).^ 

The domain involves a human and robot collaborating to 
paint a box. The box is defined by its location along the 
horizontal (5 positions) and vertical (11 positions) axes, as 
well as its tilt angle (11 values), for a total of 605 states. 
The possible actions of the robot are to change each of the 
three dimensions of the box’s location (to stay the same 
or move forward or backward along that axis), resulting in 
3^ = 27 actions. The transition dynamics are determinis¬ 
tic and identical for all 4 MDP models. The MDP mod¬ 
els differ in their (deterministic) reward models. Nikolaidis 
et al. (2015) learned the MDP models using the EM algo¬ 
rithm and inverse reinforcement learning from a set of 15 
humans performing 4 different variants of the human-robot 
box painting task (varying by which position the human per¬ 
formed the task in) where the robot annotates actions for 
the robot.^ We introduced a small amount of Gaussian noise 
(with 0.01 standard deviation and zero mean) to the rewards. 
Note that even if the models are known to be deterministic, 
an agent learning with no prior information must still visit 
all S X A = 16335 state-action pairs at least once to learn 
their dynamics, and of course it is not always possible to 
directly reach any other state in a single action. 

In our simulation, for each task one of the 4 MDPs was 
randomly selected, and the agent executed in it for H steps 
without a priori knowledge of its identity. We set the horizon 
length hy H = 5SA = 49005, so that it was feasible to visit 
all state-action pairs at least once. 

We tested our ExpEirst algorithm on this domain with 
a total number of tasks per “run” as 80. We report results 
averaged over 30 runs. 


^We thank those authors for sharing their models. 

^Inverse reinforcement learning is then used to infer a reward 
function of the human user that would make the actions prescribed 
by the human for the robot optimal. 



